This project began as an extension of a DataCamp course on business analytics using Microsoft Excel. The course used an Excel Workbook with global sales and client data for athletic wear and taught how to create basic graphs and tables in Excel. After completing the course, I challenged myself to see if I could recreate the graphs and tables using RStudio, which is my analytics tool of choice.
After completing this challenge, I went a step further and added to my analytics by creating interactive graphs and global heat maps of sales to add even more analytics and insight to a potential colleague, boss, or investor. Below are the various graphs and tables with ample descriptions of what I created and why I thought it was important.
The data used for the following analytics comes from a DataCamp course on Microsoft Excel, which can be found here. I completed the first chapter of the course, which walks through data cleaning, sheet renaming, and sheet reorganization, which is necessary to complete prior to following along with my analytics below (the first chapter of the DataCamp course is free to complete). After completing the course, I decided to load in the “Orders” sheet from the DataCamp workbook to conduct my analytics.
# load in pacman
library(pacman)
# load in necessary packages
p_load(readxl,janitor,ggplot2,plotly,ggthemes,dplyr)
# load in orders sheet from DataCamp Workbook
sales_data <- read_excel("~/github_repos/projects-and-work-samples/athletic_wear_sales_analytics_project_files/data/sales_data.xlsx",
sheet = "Orders")
sales_data## # A tibble: 1,269 × 24
## Order I…¹ Order…² `Order Date` Days …³ Late_…⁴ Shipp…⁵ Categ…⁶ Custo…⁷
## <dbl> <dbl> <dttm> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 6176 2015 2015-01-04 00:00:00 4 0 Standa… 17 3329
## 2 10384 2015 2015-01-06 00:00:00 4 1 Standa… 17 587
## 3 14551 2015 2015-01-08 00:00:00 2 0 Second… 17 2028
## 4 14574 2015 2015-01-08 00:00:00 2 1 Second… 18 6594
## 5 22924 2015 2015-01-12 00:00:00 2 1 Second… 29 9704
## 6 906 2015 2015-01-14 00:00:00 4 0 Standa… 40 7141
## 7 973 2015 2015-01-15 00:00:00 4 0 Standa… 29 5118
## 8 1077 2015 2015-01-16 00:00:00 2 1 Second… 9 8103
## 9 1105 2015 2015-01-17 00:00:00 4 1 Standa… 37 9760
## 10 1186 2015 2015-01-18 00:00:00 4 0 Standa… 9 11947
## # … with 1,259 more rows, 16 more variables: `Department Id` <dbl>,
## # `Department Name` <chr>, Market <chr>, `Order City` <chr>,
## # `Order State` <chr>, `Order Zipcode` <dbl>, `Order Country` <chr>,
## # `Order Region` <chr>, `Product Category Id` <dbl>, `Product Id` <dbl>,
## # `Product Price` <dbl>, `Product Cost` <dbl>, `Order Quantity` <dbl>,
## # `Order Total Discount` <dbl>, Sales <dbl>, `Payment Type` <chr>, and
## # abbreviated variable names ¹`Order Id`, ²`Order Year`, …
The data above contains all sorts of categorical and numerical data
regarding individual orders of athletic wear. Lots of potential for
analytics! Next, I will clean the data a bit, focusing on the column
titles and making them more r-friendly (meaning lowercase letters and
underscores instead of spaces). I will use the janitor
package to quickly clean the column names using the
clean_names() function.
## # A tibble: 1,269 × 24
## order_id order_…¹ order_date days_…² late_…³ shipp…⁴ categ…⁵ custo…⁶
## <dbl> <dbl> <dttm> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 6176 2015 2015-01-04 00:00:00 4 0 Standa… 17 3329
## 2 10384 2015 2015-01-06 00:00:00 4 1 Standa… 17 587
## 3 14551 2015 2015-01-08 00:00:00 2 0 Second… 17 2028
## 4 14574 2015 2015-01-08 00:00:00 2 1 Second… 18 6594
## 5 22924 2015 2015-01-12 00:00:00 2 1 Second… 29 9704
## 6 906 2015 2015-01-14 00:00:00 4 0 Standa… 40 7141
## 7 973 2015 2015-01-15 00:00:00 4 0 Standa… 29 5118
## 8 1077 2015 2015-01-16 00:00:00 2 1 Second… 9 8103
## 9 1105 2015 2015-01-17 00:00:00 4 1 Standa… 37 9760
## 10 1186 2015 2015-01-18 00:00:00 4 0 Standa… 9 11947
## # … with 1,259 more rows, 16 more variables: department_id <dbl>,
## # department_name <chr>, market <chr>, order_city <chr>, order_state <chr>,
## # order_zipcode <dbl>, order_country <chr>, order_region <chr>,
## # product_category_id <dbl>, product_id <dbl>, product_price <dbl>,
## # product_cost <dbl>, order_quantity <dbl>, order_total_discount <dbl>,
## # sales <dbl>, payment_type <chr>, and abbreviated variable names
## # ¹order_year, ²days_for_shipment_scheduled, ³late_delivery_risk, …
Now the column headers are all lowercase and have underscores instead of spaces!
Now that the data is clean and easier to work with, I want to see how total annual sales have changed in certain markets. First I will look at the sales in Africa.
sales_data %>%
# filter for only observations in Africa
filter(market=="Africa") %>%
# group by order year so we can plot one point for each year
group_by(order_year) %>%
# create a sum of the 'sales' column after grouping by each year to obtain the total sales in each year
mutate(total_annual_sales = sum(sales)) %>%
# create graph
ggplot(mapping = aes(x = order_year,
y = total_annual_sales)) +
geom_point(color = "firebrick", size = 3) +
geom_line(color = "firebrick", linewidth = 1.5) +
# use 'ggthemes' package to add a professional looking theme to the graph
theme_stata() +
# add labels and title
labs(title = "Annual Sales in Africa",
x = "Year",
y = "Total Annual Sales (USD)")Looks like Africa only has two years worth of data, in 2016 and 2017,
and it’s clear that sales declined over that time period. I can’t tell
from this graph what the exact sales values are, but I can venture an
educated guess. If I want to interact with the graph, I can wrap the
ggplotly function from the plotly package
around my ggplot() call from above to see exactly what the
values are for each of the two data points.
ggplotly(sales_data %>%
# filter for only observations in Africa
filter(market=="Africa") %>%
# group by order year so we can plot one point for each year
group_by(order_year) %>%
# create a sum of the 'sales' column after grouping by each year to obtain the total sales in each year
mutate(total_annual_sales = sum(sales)) %>%
# create graph
ggplot(mapping = aes(x = order_year,
y = total_annual_sales)) +
geom_point(color = "firebrick", size = 3) +
geom_line(color = "firebrick", linewidth = 1.5) +
# use 'ggthemes' package to add a professional looking theme to the graph
theme_stata() +
# add labels and title
labs(title = "Annual Sales in Africa",
x = "Year",
y = "Total Annual Sales (USD)"))The graph looks exactly the same, but if you hover over the data points we can see the actual total sales amount for 2016 and 2017. Now I see that sales fell from $40,484.64 in 2016 to $ 6,477.34 in 2017.
Next, I want to look at the annual sales in Europe. First, I’ll start
with the standard ggplot() graph.
sales_data %>%
# filter for European observations
filter(market == "Europe") %>%
# group by 'order_year' to calculate total sales per each year
group_by(order_year) %>%
# Create total_sales value that is sum of sales in each year
mutate(total_annual_sales = sum(sales)) %>%
# create graph
ggplot(aes(x = order_year,
y = total_annual_sales)) +
geom_point(color = "steelblue3",
size = 3) +
geom_line(color = "steelblue3",
linewidth = 1.5) +
# add theme using 'ggthemes' package
theme_stata() +
# add labels
labs(title = "Annual Sales in Europe",
x = "Year",
y = "Total Annual Sales")Europe has one more years worth of data than Africa, with data from
2015-2017, and has additional volatility over that time period. Like I
did with the Africa graph, I will use the ggplotly()
function here to observe the sales values from each year.
ggplotly(sales_data %>%
# filter for European observations
filter(market == "Europe") %>%
# group by 'order_year' to calculate total sales per each year
group_by(order_year) %>%
# Create total_sales value that is sum of sales in each year
mutate(total_annual_sales = sum(sales)) %>%
# create graph
ggplot(aes(x = order_year,
y = total_annual_sales)) +
geom_point(color = "steelblue3",
size = 3) +
geom_line(color = "steelblue3",
size = 1.5) +
# add theme using 'ggthemes' package
theme_stata() +
# add labels
labs(title = "Annual Sales in Europe",
x = "Year",
y = "Total Annual Sales"))## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
With the interactive graph above, I see that sales in Europe were equivalent to $21,717.55 in 2015, decreasing to $2,964.55 in 2016, and rising finally to $36,172.24 in 2017.
The next analytics I am interested in conducting is a graph of all sales in each department to get a better idea as to which departments can generate the most revenue for this company per order. I will create a bar graph that visualizes this data, opting this time to only show the interactive plot so I can observe the exact sales values for each order from each department.
# wrap in ggplotly() to make interactive
ggplotly(sales_data %>%
# use reorder() in aes() function to put bars in desc. order (-sales means desc. order by sales variable)
ggplot(aes(department_name,
sales)) +
# add bars. "stat = 'identity'" allows me to change the y axis from count (default setting) to sales variable from data
geom_bar(stat = "identity", fill = "skyblue3", position = 'dodge') +
theme_stata() +
labs(title = "Revenue values by Department",
y = "Sales (USD)",
x = "Department")) This graph offers a lot in terms of actionable insights. By moving the cursor up inside of each bar, I can see the dollar value of each order from that particular department. Clearly, the technology department can offer the most in terms of revenue, as it has the most expensive orders in all the data, with fitness being a distant second. In order to make a decision on revenue, however, I will next create a bar graph detailing the total sales per department and see if the results are the same or different from this graph.
In order to tell which department is responsible for the most sales,
or revenue, I will need to calculate the total sales by department
first, then create a graph similar to the ones above, but using the
newly created total_sales value instead of the normal
sales value.
# create new column for total sales grouped by department
sales_data <-sales_data %>%
group_by(department_name) %>%
mutate(total_department_sales = sum(sales))
# create graph
ggplotly(
ggplot(sales_data,
aes(department_name,total_department_sales)) +
geom_bar(stat = 'identity',
position = 'dodge',
fill = 'palevioletred') +
theme_stata() +
labs(title = "Total Revenue by Department",
y = "Total Sales (USD)",
x = "Department")
)Now this graph tells a very different story from the previous one. In the previous graph, we saw that the technology department had the highest order values of any department, but this new graph shows that the technology department accounts for very little in terms of total revenue. Apparel is by far bringing in the most revenue, followed by Fitness, Footwear, and Golf all bringing in similar amounts of revenue. So, even though the technology department has some of the most expensive orders, these are a handful of orders that, when compared to the massive quantities of apparel orders, is dwarfed in terms of the amount of revenue generated.
Now that we have seen the individual and total sales values across departments, let’s add another element, the market, to see how departmental sales vary geographically. This will provide us with insight as to where geographically each department is generating the most revenue.
# wrap in ggplotly() to create interactive graph
ggplotly(
sales_data %>%
# group by market
group_by(market, department_name) %>%
# create total sales value by summing sales for each market
summarise(total_market_sales = sum(sales)) %>%
# graph total sales by market
ggplot(aes(x = market, y = total_market_sales, fill = department_name)) +
geom_bar(stat = "identity",
position = "dodge") +
labs(title = "Total Department Sales by Market",
y = "Total Sales (USD)",
x = "Market",
fill = "Department Name") +
scale_fill_colorblind() +
theme_stata()
)## `summarise()` has grouped output by 'market'. You can override using the
## `.groups` argument.
The last data visualization I will create is a heat map to visualize where the most orders are coming from around the world. This will be informative as it will give me an idea of what countries in each market are doing the most business with this athletic wear company.
The first step in this process is to create a new column in the data set that contains the total number of orders for each country. In this case, every single observation (or row) in the original data set is a unique order, so all I need to do is group by each country and create a new column whose value is equal to the number of rows each country has.
sales_data <- sales_data %>%
group_by(order_country) %>%
mutate(total_orders = n_distinct(order_id))Next, I will use the rnaturalearth package to extract
countries map data to use in constructing the map. I will save the
countries data as a new object, which I will then merge with the sales
data to have one data set with all the information I need.
Now that I have the two data sets, one with sales data and the other
with the map data, I will merge them together to have one data set with
all the information. In order to do this seamlessly, both data sets need
to have one column in common. In this case, both data sets have a column
with country names. I will need to change the column name in the
sales_data data set to match that in the
countries data set, which is called
name_long.
Now that both data sets have a column called name_long
that contains info on country names, I can merge them using these
columns.
Now, with all the data in one dataset, I can create a heat map where countries are filled in by a color corresponding to the number of orders they placed.
ggplot(map_density_data) +
geom_sf(mapping=aes(fill = total_orders,
geometry=geometry),
color = "gray10", stat="sf") +
scale_fill_viridis_c(name="Number of Orders",
option="plasma",
direction = -1,
na.value='grey85',
trans='log',
breaks=scales::log_breaks(n=5, base=10),
labels=scales::comma) +
ggtitle("Global Orders by Country") +
labs(caption = "Darker color = more orders")This map clearly shows which countries are responsible for placing the most orders. The color scale on the right side of the graph indicates that a darker shade means that a higher quantity of orders was placed. As a result, the United States, China, Brasil, India, and Australia are the countries with some of the highest order totals across the globe.